Korean Part-of-speech Tagging Based on a Hidden Markov Model

نویسندگان

  • Jae-Hoon Kim
  • Chul-Su Lim
  • Jungyun Seo
چکیده

In this paper, we describe a method for assigning a part-of-speech tag in Korean to each morpheme. The method is based on a hidden Markov model which can be trained without using any tagged corpus. To relax the amount of computation to process multiple observation sequences, which are extraordinarily occurred in Korean part-of-speech tagging, we develop a revised Viterbi algorithm for determining the most promising tag sequence. Experimental results show that the accuracy of the model is approximately 90% on average. The performance is no better than those of English tagging systems. This is due to the partially free word order feature of Korean and the lack of training data. However, to the best of our knowledge, this is the rst Korean POS tagging system which can be trained without using tagged corpus. Many words have ambiguous part-of-speech(POS) tags. For example, `time' in English can be a noun, a verb, or an adjective. In many cases, such ambiguities can be resolved using contextual information. For example, \Time is money.", the word`time' can be determined to a noun. The assignment of the correct POS tag to each word in a text is called part-of-speech tagging. The part-of-speech tagger is a system that selects the most appropriate POS tag for each word using the contextual information. POS tagging has been treated by several approaches; rule-based approaches, statistical approaches, neural network approaches and so on. We deal with the POS tagging problem by the statistical method which is described in terms of a Markov model. Hidden Markov modeling permits us to compute the most probable sequence of state transitions , which is the most likely sequence of POS tags for a given sentence. POS tagging in Korean has diierent aspects from that in English. In Korean, most word phrases 1 consist of more than one morpheme. In order to assign POS tags to the morphemes in each word phrase, we should morphologically analyze the word phrase in advance. One word phrase, however, may be analyzed in several diierent ways due to lexical ambiguities. Furthermore , each analyzed result may consist of the different number of morphemes. POS tagging in Korean can be done by the unit of a word phrasee12]. However , some problems in the word phrase-based tagging are the followings: Because all possible tags for word phrases cannot be predicted in advance, whenever a new tag for a word phrase is …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

برچسب‌گذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی

Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...

متن کامل

TAKTAG: Two-phase learning method for hybrid statistical/rule-based part-of-speech disambiguation

Both statistical and rule-based approaches to part-of-speech (POS) disambiguation have their own advantages and limitations. Especially for Korean, the narrow windows provided by hidden markov model (HMM) cannot cover the necessary lexical and longdistance dependencies for POS disambiguation. On the other hand, the rule-based approaches are not accurate and flexible to new tag-sets and language...

متن کامل

Automatic Word Spacing Using Hidden Markov Model for Refining Korean Text Corpora

This paper proposes a word spacing model using a hidden Markov model (HMM) for re ning Korean raw text corpora. Previous statistical approaches for automatic word spacing have used models that make use of inaccurate probabilities because they do not consider the previous spacing state. We consider word spacing problem as a classi cation problem such as Part-of-Speech (POS) tagging and have expe...

متن کامل

Speech enhancement based on hidden Markov model using sparse code shrinkage

This paper presents a new hidden Markov model-based (HMM-based) speech enhancement framework based on the independent component analysis (ICA). We propose analytical procedures for training clean speech and noise models by the Baum re-estimation algorithm and present a Maximum a posterior (MAP) estimator based on Laplace-Gaussian (for clean speech and noise respectively) combination in the HMM ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997